Chicago Crimes Situation EDA and Prediction
Table of Contents
Configuration:
RAM: 16GB
CPU: i7-8750H CPU @2.2GHZ
Graphics card: GTX1060(desktop)
# pip install missingno
# pip install catboost
# pip install lightgbm
# pip install plotly
# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.ensemble import ExtraTreesClassifier
from lightgbm import LGBMClassifier
from sklearn.ensemble import VotingClassifier
import folium
from folium.plugins import HeatMap
import plotly.express as px
plt.style.use('fivethirtyeight')
%matplotlib inline
pd.set_option('display.max_columns', 32)
# reading data
df = pd.read_csv('Chicago_Crimes_DataSet/Chicago_Crimes_2012_to_2017.csv')
df.head()
| Unnamed: 0 | ID | Case Number | Date | Block | IUCR | Primary Type | Description | Location Description | Arrest | Domestic | Beat | District | Ward | Community Area | FBI Code | X Coordinate | Y Coordinate | Year | Updated On | Latitude | Longitude | Location | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | 10508693 | HZ250496 | 05/03/2016 11:40:00 PM | 013XX S SAWYER AVE | 0486 | BATTERY | DOMESTIC BATTERY SIMPLE | APARTMENT | True | True | 1022 | 10.0 | 24.0 | 29.0 | 08B | 1154907.0 | 1893681.0 | 2016 | 05/10/2016 03:56:50 PM | 41.864073 | -87.706819 | (41.864073157, -87.706818608) |
| 1 | 89 | 10508695 | HZ250409 | 05/03/2016 09:40:00 PM | 061XX S DREXEL AVE | 0486 | BATTERY | DOMESTIC BATTERY SIMPLE | RESIDENCE | False | True | 313 | 3.0 | 20.0 | 42.0 | 08B | 1183066.0 | 1864330.0 | 2016 | 05/10/2016 03:56:50 PM | 41.782922 | -87.604363 | (41.782921527, -87.60436317) |
| 2 | 197 | 10508697 | HZ250503 | 05/03/2016 11:31:00 PM | 053XX W CHICAGO AVE | 0470 | PUBLIC PEACE VIOLATION | RECKLESS CONDUCT | STREET | False | False | 1524 | 15.0 | 37.0 | 25.0 | 24 | 1140789.0 | 1904819.0 | 2016 | 05/10/2016 03:56:50 PM | 41.894908 | -87.758372 | (41.894908283, -87.758371958) |
| 3 | 673 | 10508698 | HZ250424 | 05/03/2016 10:10:00 PM | 049XX W FULTON ST | 0460 | BATTERY | SIMPLE | SIDEWALK | False | False | 1532 | 15.0 | 28.0 | 25.0 | 08B | 1143223.0 | 1901475.0 | 2016 | 05/10/2016 03:56:50 PM | 41.885687 | -87.749516 | (41.885686845, -87.749515983) |
| 4 | 911 | 10508699 | HZ250455 | 05/03/2016 10:00:00 PM | 003XX N LOTUS AVE | 0820 | THEFT | $500 AND UNDER | RESIDENCE | False | True | 1523 | 15.0 | 28.0 | 25.0 | 06 | 1139890.0 | 1901675.0 | 2016 | 05/10/2016 03:56:50 PM | 41.886297 | -87.761751 | (41.886297242, -87.761750709) |
df.describe()
| Unnamed: 0 | ID | Beat | District | Ward | Community Area | X Coordinate | Y Coordinate | Year | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.456714e+06 | 1.456714e+06 | 1.456714e+06 | 1.456713e+06 | 1.456700e+06 | 1.456674e+06 | 1.419631e+06 | 1.419631e+06 | 1.456714e+06 | 1.419631e+06 | 1.419631e+06 |
| mean | 3.308606e+06 | 9.597550e+06 | 1.150644e+03 | 1.125920e+01 | 2.287027e+01 | 3.745632e+01 | 1.164398e+06 | 1.885523e+06 | 2.013897e+03 | 4.184147e+01 | -8.767224e+01 |
| std | 1.235350e+06 | 8.083505e+05 | 6.916466e+02 | 6.904691e+00 | 1.380589e+01 | 2.144029e+01 | 1.850835e+04 | 3.424775e+04 | 1.449584e+00 | 9.430126e-02 | 6.661726e-02 |
| min | 3.000000e+00 | 2.022400e+04 | 1.110000e+02 | 1.000000e+00 | 1.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 2.012000e+03 | 3.661945e+01 | -9.168657e+01 |
| 25% | 2.698636e+06 | 9.002709e+06 | 6.130000e+02 | 6.000000e+00 | 1.000000e+01 | 2.300000e+01 | 1.152544e+06 | 1.858762e+06 | 2.013000e+03 | 4.176787e+01 | -8.771528e+01 |
| 50% | 3.063654e+06 | 9.605776e+06 | 1.024000e+03 | 1.000000e+01 | 2.300000e+01 | 3.200000e+01 | 1.166021e+06 | 1.891502e+06 | 2.014000e+03 | 4.185797e+01 | -8.766613e+01 |
| 75% | 3.428849e+06 | 1.022577e+07 | 1.711000e+03 | 1.700000e+01 | 3.400000e+01 | 5.600000e+01 | 1.176363e+06 | 1.908713e+06 | 2.015000e+03 | 4.190529e+01 | -8.762813e+01 |
| max | 6.253474e+06 | 1.082788e+07 | 2.535000e+03 | 3.100000e+01 | 5.000000e+01 | 7.700000e+01 | 1.205119e+06 | 1.951573e+06 | 2.017000e+03 | 4.202271e+01 | -8.752453e+01 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1456714 entries, 0 to 1456713 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 1456714 non-null int64 1 ID 1456714 non-null int64 2 Case Number 1456713 non-null object 3 Date 1456714 non-null object 4 Block 1456714 non-null object 5 IUCR 1456714 non-null object 6 Primary Type 1456714 non-null object 7 Description 1456714 non-null object 8 Location Description 1455056 non-null object 9 Arrest 1456714 non-null bool 10 Domestic 1456714 non-null bool 11 Beat 1456714 non-null int64 12 District 1456713 non-null float64 13 Ward 1456700 non-null float64 14 Community Area 1456674 non-null float64 15 FBI Code 1456714 non-null object 16 X Coordinate 1419631 non-null float64 17 Y Coordinate 1419631 non-null float64 18 Year 1456714 non-null int64 19 Updated On 1456714 non-null object 20 Latitude 1419631 non-null float64 21 Longitude 1419631 non-null float64 22 Location 1419631 non-null object dtypes: bool(2), float64(7), int64(4), object(10) memory usage: 236.2+ MB
# checking for null values
null = pd.DataFrame({'Null Values' : df.isna().sum(), 'Percentage Null Values' : (df.isna().sum()) / (df.shape[0]) * (100)})
null
| Null Values | Percentage Null Values | |
|---|---|---|
| Unnamed: 0 | 0 | 0.000000 |
| ID | 0 | 0.000000 |
| Case Number | 1 | 0.000069 |
| Date | 0 | 0.000000 |
| Block | 0 | 0.000000 |
| IUCR | 0 | 0.000000 |
| Primary Type | 0 | 0.000000 |
| Description | 0 | 0.000000 |
| Location Description | 1658 | 0.113818 |
| Arrest | 0 | 0.000000 |
| Domestic | 0 | 0.000000 |
| Beat | 0 | 0.000000 |
| District | 1 | 0.000069 |
| Ward | 14 | 0.000961 |
| Community Area | 40 | 0.002746 |
| FBI Code | 0 | 0.000000 |
| X Coordinate | 37083 | 2.545661 |
| Y Coordinate | 37083 | 2.545661 |
| Year | 0 | 0.000000 |
| Updated On | 0 | 0.000000 |
| Latitude | 37083 | 2.545661 |
| Longitude | 37083 | 2.545661 |
| Location | 37083 | 2.545661 |
# filling null values with zero
df.fillna(0, inplace = True)
# visualizing null values
msno.bar(df)
plt.show()
# # adults, babies and children cant be zero at same time, so dropping the rows having all these zero at same time
# filter = (df.children == 0) & (df.adults == 0) & (df.babies == 0)
# df[filter]
# df = df[~filter]
df
| Unnamed: 0 | ID | Case Number | Date | Block | IUCR | Primary Type | Description | Location Description | Arrest | Domestic | Beat | District | Ward | Community Area | FBI Code | X Coordinate | Y Coordinate | Year | Updated On | Latitude | Longitude | Location | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | 10508693 | HZ250496 | 05/03/2016 11:40:00 PM | 013XX S SAWYER AVE | 0486 | BATTERY | DOMESTIC BATTERY SIMPLE | APARTMENT | True | True | 1022 | 10.0 | 24.0 | 29.0 | 08B | 1154907.0 | 1893681.0 | 2016 | 05/10/2016 03:56:50 PM | 41.864073 | -87.706819 | (41.864073157, -87.706818608) |
| 1 | 89 | 10508695 | HZ250409 | 05/03/2016 09:40:00 PM | 061XX S DREXEL AVE | 0486 | BATTERY | DOMESTIC BATTERY SIMPLE | RESIDENCE | False | True | 313 | 3.0 | 20.0 | 42.0 | 08B | 1183066.0 | 1864330.0 | 2016 | 05/10/2016 03:56:50 PM | 41.782922 | -87.604363 | (41.782921527, -87.60436317) |
| 2 | 197 | 10508697 | HZ250503 | 05/03/2016 11:31:00 PM | 053XX W CHICAGO AVE | 0470 | PUBLIC PEACE VIOLATION | RECKLESS CONDUCT | STREET | False | False | 1524 | 15.0 | 37.0 | 25.0 | 24 | 1140789.0 | 1904819.0 | 2016 | 05/10/2016 03:56:50 PM | 41.894908 | -87.758372 | (41.894908283, -87.758371958) |
| 3 | 673 | 10508698 | HZ250424 | 05/03/2016 10:10:00 PM | 049XX W FULTON ST | 0460 | BATTERY | SIMPLE | SIDEWALK | False | False | 1532 | 15.0 | 28.0 | 25.0 | 08B | 1143223.0 | 1901475.0 | 2016 | 05/10/2016 03:56:50 PM | 41.885687 | -87.749516 | (41.885686845, -87.749515983) |
| 4 | 911 | 10508699 | HZ250455 | 05/03/2016 10:00:00 PM | 003XX N LOTUS AVE | 0820 | THEFT | $500 AND UNDER | RESIDENCE | False | True | 1523 | 15.0 | 28.0 | 25.0 | 06 | 1139890.0 | 1901675.0 | 2016 | 05/10/2016 03:56:50 PM | 41.886297 | -87.761751 | (41.886297242, -87.761750709) |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1456709 | 6250330 | 10508679 | HZ250507 | 05/03/2016 11:33:00 PM | 026XX W 23RD PL | 0486 | BATTERY | DOMESTIC BATTERY SIMPLE | APARTMENT | True | True | 1034 | 10.0 | 28.0 | 30.0 | 08B | 1159105.0 | 1888300.0 | 2016 | 05/10/2016 03:56:50 PM | 41.849222 | -87.691556 | (41.849222028, -87.69155551) |
| 1456710 | 6251089 | 10508680 | HZ250491 | 05/03/2016 11:30:00 PM | 073XX S HARVARD AVE | 1310 | CRIMINAL DAMAGE | TO PROPERTY | APARTMENT | True | True | 731 | 7.0 | 17.0 | 69.0 | 14 | 1175230.0 | 1856183.0 | 2016 | 05/10/2016 03:56:50 PM | 41.760744 | -87.633335 | (41.760743949, -87.63333531) |
| 1456711 | 6251349 | 10508681 | HZ250479 | 05/03/2016 12:15:00 AM | 024XX W 63RD ST | 041A | BATTERY | AGGRAVATED: HANDGUN | SIDEWALK | False | False | 825 | 8.0 | 15.0 | 66.0 | 04B | 1161027.0 | 1862810.0 | 2016 | 05/10/2016 03:56:50 PM | 41.779235 | -87.685207 | (41.779234743, -87.685207125) |
| 1456712 | 6253257 | 10508690 | HZ250370 | 05/03/2016 09:07:00 PM | 082XX S EXCHANGE AVE | 0486 | BATTERY | DOMESTIC BATTERY SIMPLE | SIDEWALK | False | True | 423 | 4.0 | 7.0 | 46.0 | 08B | 1197261.0 | 1850727.0 | 2016 | 05/10/2016 03:56:50 PM | 41.745252 | -87.552773 | (41.745251975, -87.552773464) |
| 1456713 | 6253474 | 10508692 | HZ250517 | 05/03/2016 11:38:00 PM | 001XX E 75TH ST | 5007 | OTHER OFFENSE | OTHER WEAPONS VIOLATION | PARKING LOT/GARAGE(NON.RESID.) | True | False | 323 | 3.0 | 6.0 | 69.0 | 26 | 1178696.0 | 1855324.0 | 2016 | 05/10/2016 03:56:50 PM | 41.758309 | -87.620658 | (41.75830866, -87.620658418) |
1456714 rows × 23 columns
# 根据犯罪类型和年份分组统计数量
grouped = df.groupby(['Year','Primary Type']).size()
# 计算每个犯罪类型在每个年份的比例
grouped_pct = grouped.groupby(level=1).apply(lambda x: 100 * x / float(x.sum()))
# 绘制百分比堆积柱状图
fig, ax = plt.subplots(figsize=(20, 10))
grouped_pct.unstack().plot(kind='bar', stacked=True, ax=ax)
# 设置图表属性
ax.set_xlabel('Year')
ax.set_ylabel('Percentage of Crimes')
ax.set_title('Crime Types Over the Years')
# 显示图表
plt.show()
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
df1=df
# 加载GeoJSON文件
chicago = gpd.read_file('Chicago_Crimes_DataSet/chicago_community_areas.geojson')
# 匹配GeoJSON中的area_numbe与CSV中的Community Area
df1['area_numbe'] = df['Community Area'].astype(int).astype(str)
df1 = df.merge(chicago, on='area_numbe')
# 计算每个Community Area的犯罪事件总数
crime_count = df.groupby('area_numbe').size().reset_index(name='crime_count')
# 将结果与GeoJSON合并,并呈现在地图上
chicago = chicago.merge(crime_count, on='area_numbe')
ax = chicago.plot(column='crime_count', cmap='OrRd', legend=True, figsize=(15, 15))
# 添加每个区域的标签
for idx, row in chicago.iterrows():
plt.annotate(text=row['crime_count'], xy=row['geometry'].centroid.coords[0], horizontalalignment='center')
# 显示地图
plt.axis('off')
plt.show()
Community No. 25, which is the Austin community, has the highest crime rate, try to avoid lightning.
import pandas as pd
import geopandas as gpd
# 加载数据集
df2 = df
# 匹配GeoJSON中的area_numbe与CSV中的Community Area
chicago = gpd.read_file('Chicago_Crimes_DataSet/chicago_community_areas.geojson')
df2['area_numbe'] = df['Community Area'].astype(int).astype(str)
df2 = df2.merge(chicago, on='area_numbe')
# 计算每个Community Area的犯罪事件总数和涉及家暴的事件总数
crime_count = df2.groupby('area_numbe').size().reset_index(name='crime_count')
domestic_count = df2.loc[df['Domestic'] == True].groupby('area_numbe').size().reset_index(name='domestic_count')
# 将结果与GeoJSON合并,并计算家暴犯罪占比
chicago = chicago.merge(crime_count, on='area_numbe')
chicago = chicago.merge(domestic_count, on='area_numbe', how='left').fillna(0)
chicago['domestic_ratio'] = chicago['domestic_count'] / chicago['crime_count'] * 100
# 排序并打印前三个区
top_3 = chicago[['community', 'domestic_ratio']].sort_values('domestic_ratio', ascending=False).head(3)
print(top_3)
# 在地图上呈现
ax = chicago.plot(column='crime_count', cmap='OrRd', legend=True, figsize=(15, 15))
ax.set_title("Percentage of Domestic Related Crimes by Community Area in Chicago")
ax.set_axis_off()
for idx, row in chicago.iterrows():
ax.annotate(text=str(int(row['domestic_ratio'])) + '%', xy=row['geometry'].centroid.coords[0],
horizontalalignment='center', verticalalignment='center')
community domestic_ratio 11 FOREST GLEN 18.088871 60 GAGE PARK 17.846542 15 IRVING PARK 17.439658
Based on the analysis of the crime data in Chicago, the community areas with the highest percentage of domestic-related crimes are Forest Glen, Gage Park, and Irving Park, with domestic ratios of 18.09%, 17.85%, and 17.44% respectively. These findings suggest that domestic violence is a significant issue in these areas and that policymakers and law enforcement agencies may need to implement targeted interventions to address this problem. Further research may be needed to investigate the underlying factors contributing to the high rates of domestic violence in these communities and to develop effective strategies for prevention and intervention.
import pandas as pd
import matplotlib.pyplot as plt
# Load the dataset
df3 = df
# Convert the Date column to datetime format
df3['Date'] = pd.to_datetime(df3['Date'])
# Extract the day of the week from the Date column
df3['DayOfWeek'] = df3['Date'].dt.day_name()
# Group the crimes by day of the week and count the number of crimes on each day
crimes_by_day = df3.groupby('DayOfWeek')['ID'].count()
# Set the figure size
fig, ax = plt.subplots(figsize=(10, 10))
# Plot a pie chart of the distribution
ax.pie(crimes_by_day, labels=crimes_by_day.index, autopct='%1.1f%%')
ax.set_title('Distribution of Crimes by Day of the Week')
# Show the plot
plt.show()
# Print the percentage of crimes on each day
print('Crimes by Day of the Week:\n', crimes_by_day)
Crimes by Day of the Week: DayOfWeek Friday 218643 Monday 205762 Saturday 209743 Sunday 202212 Thursday 205851 Tuesday 206129 Wednesday 208374 Name: ID, dtype: int64
The distribution of crimes by day of the week shows that the number of crimes committed is relatively evenly spread throughout the week. The highest number of crimes occur on Saturdays, followed closely by Fridays and Wednesdays. Sundays have the lowest number of crimes reported. However, the differences in the number of crimes committed on each day of the week are relatively small. The results suggest that there is no significant correlation between the day of the week and the occurrence of crimes. Further analysis is needed to investigate the potential factors that contribute to the variations in the number of crimes reported on different days of the week.
import pandas as pd
import plotly.express as px
# Load the dataset
df4 = pd.read_csv('Chicago_Crimes_DataSet/Chicago_Crimes_2012_to_2017.csv')
# Count the number of crimes by location description
crime_counts = df4['Location Description'].value_counts()
# Create a dataframe with the location descriptions and counts
crime_df = pd.DataFrame({'Location Description': crime_counts.index, 'Count': crime_counts.values})
# Calculate the percentage of crimes in each location
crime_df['Percentage'] = crime_df['Count'] / sum(crime_df['Count']) * 100
# Create the Treemap figure
fig = px.treemap(crime_df, path=['Location Description'], values='Count', color='Percentage',height=700,
color_continuous_scale='spectral', labels={'Count': 'Number of Crimes', 'Percentage': 'Percentage of Total Crimes'})
# Update the text position and font size
fig.update_traces(textposition='middle center', textfont=dict(size=10), texttemplate='%{value}<br>%{label}')
# Show the figure
fig.show()
# Sort the dataframe by percentage in descending order
crime_df = crime_df.sort_values(by='Percentage', ascending=False)
# Select only the top five rows
top_five = crime_df[['Location Description', 'Percentage']].head(5)
# Print the top five locations by percentage
print(top_five.to_string(index=False))
Location Description Percentage
STREET 22.711909
RESIDENCE 16.049554
APARTMENT 12.715868
SIDEWALK 11.057375
OTHER 3.833117
df = pd.read_csv('Chicago_Crimes_DataSet/Chicago_Crimes_2012_to_2017.csv')
null = pd.DataFrame({'Null Values' : df.isna().sum(), 'Percentage Null Values' : (df.isna().sum()) / (df.shape[0]) * (100)})
df.fillna(0, inplace = True)
# Merge Crime Dataset with Economy DataSet
# get the economy data
poverty = pd.read_excel('Chicago_Economy_DataSet/poverty.xlsx')
median_house_income = pd.read_excel('Chicago_Economy_DataSet/median_house_income.xlsx')
black = pd.read_excel('Chicago_Economy_DataSet/black.xlsx')
naturecitizenship = pd.read_excel('Chicago_Economy_DataSet/citizenship.xlsx')
education = pd.read_excel('Chicago_Economy_DataSet/education.xlsx')
health_insurance = pd.read_excel('Chicago_Economy_DataSet/health_insurance.xlsx')
hispanic_latino = pd.read_excel('Chicago_Economy_DataSet/hispanic_latino.xlsx')
noncitizenship = pd.read_excel('Chicago_Economy_DataSet/noncitizenship.xlsx')
owner_occupied = pd.read_excel('Chicago_Economy_DataSet/owner_occupied.xlsx')
merged = pd.merge(df, poverty, on='Community Area')
merged1 = pd.merge(df, median_house_income, on='Community Area')
merged2 = pd.merge(df, black, on='Community Area')
merged3 = pd.merge(df, naturecitizenship, on='Community Area')
merged4 = pd.merge(df, health_insurance, on='Community Area')
merged5 = pd.merge(df, hispanic_latino, on='Community Area')
merged6 = pd.merge(df, noncitizenship, on='Community Area')
merged7 = pd.merge(df, owner_occupied, on='Community Area')
merged8 = pd.merge(df, education, on='Community Area')
df1 = merged
df1['poverty'] = merged.lookup(merged.index,merged['Year'])
df1['median_house_income'] = merged1.lookup(merged1.index,merged1['Year'])
df1['black'] = merged2.lookup(merged2.index,merged2['Year'])
df1['naturecitizenship'] = merged3.lookup(merged3.index,merged3['Year'])
df1['health_insurance'] = merged4.lookup(merged4.index,merged4['Year'])
df1['hispanic_latino'] = merged5.lookup(merged5.index,merged5['Year'])
df1['noncitizenship'] = merged6.lookup(merged6.index,merged6['Year'])
df1['owner_occupied'] = merged7.lookup(merged7.index,merged7['Year'])
df1['education'] = merged8.lookup(merged8.index,merged8['Year'])
df1.to_csv('test.csv')
data = pd.read_csv('test.csv')
data = data.drop(columns =['2012','2013','2014','2015','2016','2017','IUCR',"Description"],axis=1)
from sklearn.preprocessing import LabelEncoder
object_features = ["Case Number", "Date", "Block", "Primary Type",
"Location Description", "FBI Code",
"Updated On", "Location"]
for feature in object_features:
data[feature]=LabelEncoder().fit_transform(data[feature])
data.head()
| Unnamed: 0.1 | Unnamed: 0 | ID | Case Number | Date | Block | Primary Type | Location Description | Arrest | Domestic | Beat | District | Ward | Community Area | FBI Code | X Coordinate | Y Coordinate | Year | Updated On | Latitude | Longitude | Location | poverty | median_house_income | black | naturecitizenship | health_insurance | hispanic_latino | noncitizenship | owner_occupied | education | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | 10508693 | 1259878 | 193027 | 6516 | 2 | 18 | True | True | 1022 | 10.0 | 24.0 | 29.0 | 10 | 1154907.0 | 1893681.0 | 2016 | 305 | 41.864073 | -87.706819 | 192925 | 0.406 | 27397 | 0.873900 | 0.011372 | 0.889 | 0.088320 | 0.021061 | 0.245 | 0.1237 |
| 1 | 1 | 48110 | 10448396 | 1224287 | 113700 | 9284 | 3 | 18 | False | False | 1024 | 10.0 | 24.0 | 29.0 | 6 | 1153729.0 | 1890276.0 | 2016 | 306 | 41.854753 | -87.711234 | 184458 | 0.406 | 27397 | 0.873900 | 0.011372 | 0.889 | 0.088320 | 0.021061 | 0.245 | 0.1237 |
| 2 | 2 | 52366 | 10495378 | 1251596 | 175514 | 7416 | 32 | 135 | True | False | 1021 | 10.0 | 24.0 | 29.0 | 17 | 1154286.0 | 1892167.0 | 2016 | 306 | 41.859931 | -87.709139 | 189929 | 0.406 | 27397 | 0.873900 | 0.011372 | 0.889 | 0.088320 | 0.021061 | 0.245 | 0.1237 |
| 3 | 3 | 57943 | 20859 | 428099 | 188612 | 9233 | 10 | 16 | True | False | 1024 | 10.0 | 24.0 | 29.0 | 0 | 1153059.0 | 1890107.0 | 2013 | 306 | 41.854302 | -87.713697 | 183911 | 0.470 | 24144 | 0.890039 | 0.008618 | 0.828 | 0.070019 | 0.025740 | 0.250 | 0.1100 |
| 4 | 4 | 63024 | 10508736 | 1259927 | 194418 | 15986 | 1 | 18 | True | True | 1133 | 11.0 | 24.0 | 29.0 | 4 | 1152965.0 | 1894830.0 | 2016 | 307 | 41.867265 | -87.713917 | 196143 | 0.406 | 27397 | 0.873900 | 0.011372 | 0.889 | 0.088320 | 0.021061 | 0.245 | 0.1237 |
plt.figure(figsize = (36, 20))
corr = data.corr()
sns.heatmap(corr, annot = True, linewidths = 1)
plt.show()
correlation = data.corr()['Primary Type'].abs().sort_values(ascending = False)
correlation
Primary Type 1.000000 Domestic 0.266836 education 0.143164 median_house_income 0.139249 Location Description 0.112288 poverty 0.101108 black 0.094331 Block 0.082833 health_insurance 0.080434 Community Area 0.070855 Location 0.070278 Ward 0.062362 naturecitizenship 0.046366 FBI Code 0.039151 Beat 0.032409 District 0.032054 noncitizenship 0.031606 hispanic_latino 0.018144 Year 0.012897 Y Coordinate 0.010279 Arrest 0.010178 Updated On 0.010095 owner_occupied 0.009675 Case Number 0.009325 Unnamed: 0.1 0.009299 Date 0.008076 Unnamed: 0 0.003608 Latitude 0.003370 X Coordinate 0.002514 Longitude 0.002374 ID 0.000722 Name: Primary Type, dtype: float64
# dropping columns that are not useful(which is low than 0.01)
useless_col = ['owner_occupied','Case Number','Unnamed: 0.1', 'Date', 'Unnamed: 0', 'Longitude','X Coordinate', 'ID','Latitude']
data.drop(useless_col, axis = 1, inplace = True)
data.var()
Block 9.398763e+07 Primary Type 1.411251e+02 Location Description 1.675417e+03 Arrest 1.919809e-01 Domestic 1.282452e-01 Beat 4.783784e+05 District 4.767494e+01 Ward 1.906043e+02 Community Area 4.596774e+02 FBI Code 4.222913e+01 Y Coordinate 8.930387e+10 Year 2.101281e+00 Updated On 5.181823e+04 Location 1.043733e+10 poverty 1.296835e-02 median_house_income 5.766074e+08 black 1.564294e-01 naturecitizenship 3.495513e-03 health_insurance 4.472351e-03 hispanic_latino 6.618987e-02 noncitizenship 7.383363e-03 education 5.702499e-02 dtype: float64
# normalizing numerical variables
from sklearn.preprocessing import MinMaxScaler
# create a scaler object
scaler = MinMaxScaler()
# select the columns to be scaled
cols_to_scale = ['Block', 'Location Description', 'Beat', 'District', 'Ward',
'Community Area', 'FBI Code', 'Y Coordinate','Updated On',
'Year', 'Location','poverty', 'median_house_income', 'black',
'naturecitizenship' ,'health_insurance' ,'hispanic_latino',
'noncitizenship', 'education']
for feature in cols_to_scale:
data[feature] = data[feature].replace(to_replace="NaN", value=np.NaN)
data[feature] = data[feature].fillna(data[feature].median(skipna=True))
# fit and transform the selected columns using the scaler object
data[cols_to_scale] = scaler.fit_transform(data[cols_to_scale])
# print the normalized dataframe
data
| Block | Primary Type | Location Description | Arrest | Domestic | Beat | District | Ward | Community Area | FBI Code | Y Coordinate | Year | Updated On | Location | poverty | median_house_income | black | naturecitizenship | health_insurance | hispanic_latino | noncitizenship | education | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.198944 | 2 | 0.126761 | True | True | 0.375825 | 0.322581 | 0.48 | 0.368421 | 0.40 | 0.970336 | 0.8 | 0.318372 | 0.523860 | 0.607477 | 0.110899 | 0.873688 | 0.028431 | 0.726994 | 0.095362 | 0.058532 | 0.104457 |
| 1 | 0.283455 | 3 | 0.126761 | False | False | 0.376650 | 0.322581 | 0.48 | 0.368421 | 0.24 | 0.968591 | 0.8 | 0.319415 | 0.500869 | 0.607477 | 0.110899 | 0.873688 | 0.028431 | 0.726994 | 0.095362 | 0.058532 | 0.104457 |
| 2 | 0.226422 | 32 | 0.950704 | True | False | 0.375413 | 0.322581 | 0.48 | 0.368421 | 0.68 | 0.969560 | 0.8 | 0.319415 | 0.515725 | 0.607477 | 0.110899 | 0.873688 | 0.028431 | 0.726994 | 0.095362 | 0.058532 | 0.104457 |
| 3 | 0.281898 | 10 | 0.112676 | True | False | 0.376650 | 0.322581 | 0.48 | 0.368421 | 0.00 | 0.968504 | 0.2 | 0.319415 | 0.499384 | 0.707165 | 0.081797 | 0.889854 | 0.020111 | 0.539877 | 0.075602 | 0.071535 | 0.087400 |
| 4 | 0.488077 | 1 | 0.126761 | True | True | 0.421617 | 0.354839 | 0.48 | 0.368421 | 0.16 | 0.970924 | 0.8 | 0.320459 | 0.532598 | 0.607477 | 0.110899 | 0.873688 | 0.028431 | 0.726994 | 0.095362 | 0.058532 | 0.104457 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1456656 | 0.937258 | 2 | 0.859155 | False | False | 0.867162 | 0.709677 | 0.38 | 0.934211 | 0.40 | 0.000000 | 1.0 | 0.086639 | 1.000000 | 0.052960 | 0.756989 | 0.341171 | 0.070237 | 0.960123 | 0.053356 | 0.033000 | 0.635209 |
| 1456657 | 0.903062 | 8 | 0.788732 | False | False | 0.870462 | 0.709677 | 0.38 | 0.934211 | 0.52 | 0.000000 | 1.0 | 0.090814 | 1.000000 | 0.052960 | 0.756989 | 0.341171 | 0.070237 | 0.960123 | 0.053356 | 0.033000 | 0.635209 |
| 1456658 | 0.954081 | 31 | 0.901408 | False | False | 0.866749 | 0.709677 | 0.38 | 0.934211 | 0.28 | 0.000000 | 1.0 | 0.090814 | 1.000000 | 0.052960 | 0.756989 | 0.341171 | 0.070237 | 0.960123 | 0.053356 | 0.033000 | 0.635209 |
| 1456659 | 0.937410 | 8 | 0.788732 | False | False | 0.866337 | 0.709677 | 0.38 | 0.934211 | 0.52 | 0.000000 | 1.0 | 0.090814 | 1.000000 | 0.052960 | 0.756989 | 0.341171 | 0.070237 | 0.960123 | 0.053356 | 0.033000 | 0.635209 |
| 1456660 | 0.199463 | 8 | 0.161972 | False | False | 0.866749 | 0.709677 | 0.42 | 0.934211 | 0.52 | 0.000000 | 1.0 | 0.090814 | 1.000000 | 0.052960 | 0.756989 | 0.341171 | 0.070237 | 0.960123 | 0.053356 | 0.033000 | 0.635209 |
1456661 rows × 22 columns
data.var()
Block 0.087613 Primary Type 141.125095 Location Description 0.083090 Arrest 0.191981 Domestic 0.128245 Beat 0.081415 District 0.049610 Ward 0.076242 Community Area 0.079584 FBI Code 0.067567 Y Coordinate 0.023448 Year 0.084051 Updated On 0.056461 Location 0.076956 poverty 0.031454 median_house_income 0.046150 black 0.156956 naturecitizenship 0.031889 health_insurance 0.042082 hispanic_latino 0.077165 noncitizenship 0.057027 education 0.088393 dtype: float64
y = data ['Primary Type']
X = data.drop(['Primary Type'], axis=1)
X.shape,y.shape
((1456661, 21), (1456661,))
# splitting data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
# Fit the logistic regression model
lr = LogisticRegression()
lr.fit(X_train, y_train)
# Make predictions and evaluate the model
y_pred_lr = lr.predict(X_test)
acc_lr = accuracy_score(y_test, y_pred_lr)
conf = confusion_matrix(y_test, y_pred_lr)
clf_report = classification_report(y_test, y_pred_lr)
print(f"Accuracy Score of Logistic Regression is : {acc_lr}")
print(f"Confusion Matrix : \n{conf}")
print(f"Classification Report : \n{clf_report}")
Accuracy Score of Logistic Regression is : 0.7030969393078981
Confusion Matrix :
[[ 0 0 463 ... 0 0 0]
[ 0 142 11683 ... 0 8111 0]
[ 0 21 54157 ... 0 9860 0]
...
[ 0 0 110 ... 0 56 0]
[ 0 8 2887 ... 0 78180 0]
[ 0 0 0 ... 0 0 6]]
Classification Report :
precision recall f1-score support
0 0.00 0.00 0.00 559
1 0.70 0.01 0.01 22870
2 0.74 0.82 0.78 66144
3 0.54 0.24 0.34 21149
4 0.00 0.00 0.00 10
5 0.00 0.00 0.00 1748
6 0.83 0.93 0.88 38801
7 0.56 0.55 0.55 9173
8 0.89 0.60 0.72 18869
9 0.00 0.00 0.00 544
10 0.00 0.00 0.00 661
11 0.00 0.00 0.00 7
12 0.00 0.00 0.00 1556
13 0.00 0.00 0.00 164
14 0.00 0.00 0.00 272
15 0.00 0.00 0.00 491
16 0.06 0.00 0.00 15240
17 0.73 0.96 0.83 33789
18 0.00 0.00 0.00 7
19 0.00 0.00 0.00 26
21 0.00 0.00 0.00 57
22 0.92 0.21 0.34 2803
23 0.00 0.00 0.00 7
24 0.75 0.90 0.82 21830
25 0.00 0.00 0.00 1880
26 0.00 0.00 0.00 14
27 0.08 0.00 0.00 3261
28 0.76 0.90 0.82 14425
29 0.00 0.00 0.00 1255
30 0.00 0.00 0.00 198
31 0.62 0.95 0.75 82055
32 0.02 0.00 0.00 4301
accuracy 0.70 364166
macro avg 0.26 0.22 0.21 364166
weighted avg 0.65 0.70 0.64 364166
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)
acc_knn = accuracy_score(y_test, y_pred_knn)
conf = confusion_matrix(y_test, y_pred_knn)
clf_report = classification_report(y_test, y_pred_knn)
print(f"Accuracy Score of KNN is : {acc_knn}")
print(f"Confusion Matrix : \n{conf}")
print(f"Classification Report : \n{clf_report}")
Accuracy Score of KNN is : 0.8188463502908014
Confusion Matrix :
[[ 26 25 456 ... 0 8 0]
[ 7 10645 7980 ... 1 1798 0]
[ 31 3872 57851 ... 0 2168 1]
...
[ 0 75 75 ... 0 12 0]
[ 1 1267 1591 ... 0 75682 0]
[ 0 0 10 ... 0 0 2224]]
Classification Report :
precision recall f1-score support
0 0.29 0.05 0.08 559
1 0.58 0.47 0.52 22870
2 0.80 0.87 0.84 66144
3 0.77 0.77 0.77 21149
4 0.22 0.20 0.21 10
5 0.46 0.13 0.20 1748
6 0.92 0.94 0.93 38801
7 0.60 0.60 0.60 9173
8 0.92 0.84 0.88 18869
9 0.93 0.41 0.57 544
10 0.94 0.37 0.53 661
11 0.00 0.00 0.00 7
12 0.35 0.24 0.29 1556
13 0.00 0.00 0.00 164
14 0.43 0.04 0.07 272
15 0.52 0.23 0.32 491
16 0.82 0.68 0.74 15240
17 0.92 0.97 0.94 33789
18 0.00 0.00 0.00 7
19 0.00 0.00 0.00 26
21 0.00 0.00 0.00 57
22 0.54 0.19 0.28 2803
23 0.00 0.00 0.00 7
24 0.73 0.86 0.79 21830
25 0.88 0.85 0.86 1880
26 0.00 0.00 0.00 14
27 0.53 0.33 0.41 3261
28 0.82 0.79 0.80 14425
29 0.62 0.10 0.17 1255
30 0.00 0.00 0.00 198
31 0.86 0.92 0.89 82055
32 0.72 0.52 0.60 4301
accuracy 0.82 364166
macro avg 0.51 0.39 0.42 364166
weighted avg 0.81 0.82 0.81 364166
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
y_pred_dtc = dtc.predict(X_test)
acc_dtc = accuracy_score(y_test, y_pred_dtc)
conf = confusion_matrix(y_test, y_pred_dtc)
clf_report = classification_report(y_test, y_pred_dtc)
print(f"Accuracy Score of Decision Tree is : {acc_dtc}")
print(f"Confusion Matrix : \n{conf}")
print(f"Classification Report : \n{clf_report}")
Accuracy Score of Decision Tree is : 0.964812750229291
Confusion Matrix :
[[ 557 0 0 ... 0 0 0]
[ 0 22647 0 ... 223 0 0]
[ 0 0 66144 ... 0 0 0]
...
[ 0 161 0 ... 10 0 0]
[ 0 0 0 ... 0 82055 0]
[ 0 0 0 ... 0 0 4261]]
Classification Report :
precision recall f1-score support
0 0.99 1.00 0.99 559
1 0.99 0.99 0.99 22870
2 1.00 1.00 1.00 66144
3 1.00 1.00 1.00 21149
4 0.07 0.30 0.11 10
5 0.93 0.92 0.93 1748
6 1.00 1.00 1.00 38801
7 0.64 0.66 0.65 9173
8 1.00 1.00 1.00 18869
9 1.00 0.99 1.00 544
10 1.00 1.00 1.00 661
11 0.00 0.00 0.00 7
12 0.44 0.44 0.44 1556
13 0.04 0.04 0.04 164
14 0.28 0.29 0.29 272
15 1.00 1.00 1.00 491
16 1.00 1.00 1.00 15240
17 0.98 0.98 0.98 33789
18 0.00 0.00 0.00 7
19 0.00 0.00 0.00 26
21 0.19 0.21 0.20 57
22 0.60 0.65 0.63 2803
23 0.00 0.00 0.00 7
24 0.81 0.79 0.80 21830
25 1.00 1.00 1.00 1880
26 0.00 0.00 0.00 14
27 0.61 0.60 0.60 3261
28 1.00 1.00 1.00 14425
29 0.87 0.83 0.85 1255
30 0.04 0.05 0.04 198
31 1.00 1.00 1.00 82055
32 1.00 0.99 0.99 4301
accuracy 0.96 364166
macro avg 0.64 0.65 0.64 364166
weighted avg 0.97 0.96 0.97 364166
rd_clf = RandomForestClassifier()
rd_clf.fit(X_train, y_train)
y_pred_rd_clf = rd_clf.predict(X_test)
acc_rd_clf = accuracy_score(y_test, y_pred_rd_clf)
conf = confusion_matrix(y_test, y_pred_rd_clf)
clf_report = classification_report(y_test, y_pred_rd_clf)
print(f"Accuracy Score of Random Forest is : {acc_rd_clf}")
print(f"Confusion Matrix : \n{conf}")
print(f"Classification Report : \n{clf_report}")
Accuracy Score of Random Forest is : 0.9667459345463333
Confusion Matrix :
[[ 428 0 94 ... 0 0 0]
[ 0 22792 39 ... 10 1 0]
[ 0 33 66094 ... 0 5 0]
...
[ 0 169 0 ... 2 0 0]
[ 0 0 6 ... 0 82049 0]
[ 0 0 0 ... 0 0 4181]]
Classification Report :
precision recall f1-score support
0 0.99 0.77 0.86 559
1 0.98 1.00 0.99 22870
2 1.00 1.00 1.00 66144
3 1.00 1.00 1.00 21149
4 0.25 0.20 0.22 10
5 0.89 0.73 0.80 1748
6 0.99 1.00 1.00 38801
7 0.74 0.67 0.70 9173
8 1.00 1.00 1.00 18869
9 0.96 0.94 0.95 544
10 0.98 0.81 0.89 661
11 0.00 0.00 0.00 7
12 0.49 0.43 0.46 1556
13 0.17 0.04 0.07 164
14 0.33 0.11 0.16 272
15 0.90 0.78 0.83 491
16 1.00 1.00 1.00 15240
17 0.98 0.98 0.98 33789
18 0.00 0.00 0.00 7
19 0.33 0.04 0.07 26
21 0.33 0.05 0.09 57
22 0.77 0.58 0.66 2803
23 0.00 0.00 0.00 7
24 0.81 0.89 0.85 21830
25 0.98 0.97 0.97 1880
26 0.00 0.00 0.00 14
27 0.64 0.59 0.62 3261
28 0.98 0.99 0.98 14425
29 0.81 0.70 0.75 1255
30 0.11 0.01 0.02 198
31 1.00 1.00 1.00 82055
32 0.96 0.97 0.97 4301
accuracy 0.97 364166
macro avg 0.67 0.60 0.62 364166
weighted avg 0.97 0.97 0.97 364166
ada = AdaBoostClassifier(base_estimator = dtc)
ada.fit(X_train, y_train)
y_pred_ada = ada.predict(X_test)
acc_ada = accuracy_score(y_test, y_pred_ada)
conf = confusion_matrix(y_test, y_pred_ada)
clf_report = classification_report(y_test, y_pred_ada)
print(f"Accuracy Score of Ada Boost Classifier is : {acc_ada}")
print(f"Confusion Matrix : \n{conf}")
print(f"Classification Report : \n{clf_report}")
Accuracy Score of Ada Boost Classifier is : 0.971883701388927
Confusion Matrix :
[[ 557 0 0 ... 0 0 0]
[ 0 22857 0 ... 13 0 0]
[ 0 0 66144 ... 0 0 0]
...
[ 0 169 0 ... 2 0 0]
[ 0 0 0 ... 0 82055 0]
[ 0 0 0 ... 0 0 4295]]
Classification Report :
precision recall f1-score support
0 1.00 1.00 1.00 559
1 0.99 1.00 1.00 22870
2 1.00 1.00 1.00 66144
3 1.00 1.00 1.00 21149
4 0.33 0.30 0.32 10
5 0.92 0.98 0.95 1748
6 1.00 1.00 1.00 38801
7 0.72 0.65 0.68 9173
8 1.00 1.00 1.00 18869
9 1.00 0.99 1.00 544
10 1.00 1.00 1.00 661
11 0.00 0.00 0.00 7
12 0.49 0.43 0.46 1556
13 0.15 0.05 0.07 164
14 0.52 0.25 0.34 272
15 1.00 1.00 1.00 491
16 1.00 1.00 1.00 15240
17 0.99 0.98 0.98 33789
18 0.00 0.00 0.00 7
19 0.25 0.04 0.07 26
21 0.33 0.12 0.18 57
22 0.80 0.63 0.71 2803
23 0.00 0.00 0.00 7
24 0.81 0.89 0.85 21830
25 1.00 1.00 1.00 1880
26 0.00 0.00 0.00 14
27 0.67 0.65 0.66 3261
28 1.00 1.00 1.00 14425
29 0.86 0.95 0.90 1255
30 0.09 0.01 0.02 198
31 1.00 1.00 1.00 82055
32 1.00 1.00 1.00 4301
accuracy 0.97 364166
macro avg 0.69 0.65 0.66 364166
weighted avg 0.97 0.97 0.97 364166
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)
y_pred_gb = gb.predict(X_test)
acc_gb = accuracy_score(y_test, y_pred_gb)
conf = confusion_matrix(y_test, y_pred_gb)
clf_report = classification_report(y_test, y_pred_gb)
print(f"Accuracy Score of Ada Boost Classifier is : {acc_gb}")
print(f"Confusion Matrix : \n{conf}")
print(f"Classification Report : \n{clf_report}")
Accuracy Score of Ada Boost Classifier is : 0.9123970936331234
Confusion Matrix :
[[ 534 0 4 ... 0 0 0]
[ 15 22126 483 ... 2 21 0]
[ 52 255 64627 ... 11 40 0]
...
[ 0 153 11 ... 0 0 0]
[ 38 726 2733 ... 0 77798 0]
[ 1 9 2 ... 0 0 4205]]
Classification Report :
precision recall f1-score support
0 0.62 0.96 0.75 559
1 0.91 0.97 0.94 22870
2 0.93 0.98 0.95 66144
3 1.00 0.95 0.97 21149
4 0.02 0.10 0.03 10
5 0.00 0.00 0.00 1748
6 1.00 0.95 0.97 38801
7 0.68 0.56 0.61 9173
8 0.99 0.94 0.97 18869
9 1.00 0.70 0.83 544
10 0.00 0.00 0.00 661
11 0.00 0.14 0.00 7
12 0.00 0.00 0.00 1556
13 0.01 0.01 0.01 164
14 0.01 0.01 0.01 272
15 1.00 0.42 0.60 491
16 1.00 0.95 0.98 15240
17 0.99 0.92 0.95 33789
18 0.00 0.00 0.00 7
19 0.00 0.00 0.00 26
20 0.00 0.00 0.00 0
21 0.01 0.14 0.01 57
22 0.07 0.06 0.06 2803
23 0.00 0.29 0.00 7
24 0.68 0.87 0.77 21830
25 1.00 0.98 0.99 1880
26 0.00 0.00 0.00 14
27 0.59 0.49 0.54 3261
28 0.97 0.97 0.97 14425
29 0.81 0.24 0.37 1255
30 0.00 0.00 0.00 198
31 1.00 0.95 0.97 82055
32 1.00 0.98 0.99 4301
accuracy 0.91 364166
macro avg 0.49 0.47 0.46 364166
weighted avg 0.93 0.91 0.92 364166
xgb = XGBClassifier(booster = 'gbtree', learning_rate = 0.1, max_depth = 5, n_estimators = 180)
xgb.fit(X_train, y_train)
y_pred_xgb = xgb.predict(X_test)
acc_xgb = accuracy_score(y_test, y_pred_xgb)
conf = confusion_matrix(y_test, y_pred_xgb)
clf_report = classification_report(y_test, y_pred_xgb)
print(f"Accuracy Score of Ada Boost Classifier is : {acc_xgb}")
print(f"Confusion Matrix : \n{conf}")
print(f"Classification Report : \n{clf_report}")
Accuracy Score of Ada Boost Classifier is : 0.9756869120126536
Confusion Matrix :
[[ 557 0 0 ... 0 0 0]
[ 0 22870 0 ... 0 0 0]
[ 0 0 66144 ... 0 0 0]
...
[ 0 170 0 ... 0 0 0]
[ 0 0 0 ... 0 82055 0]
[ 0 0 0 ... 0 0 4299]]
Classification Report :
precision recall f1-score support
0 1.00 1.00 1.00 559
1 0.99 1.00 1.00 22870
2 1.00 1.00 1.00 66144
3 1.00 1.00 1.00 21149
4 0.50 0.20 0.29 10
5 0.92 1.00 0.96 1748
6 1.00 1.00 1.00 38801
7 0.78 0.69 0.73 9173
8 1.00 1.00 1.00 18869
9 1.00 1.00 1.00 544
10 1.00 1.00 1.00 661
11 0.00 0.00 0.00 7
12 0.60 0.37 0.46 1556
13 0.00 0.00 0.00 164
14 0.78 0.29 0.42 272
15 1.00 1.00 1.00 491
16 1.00 1.00 1.00 15240
17 0.99 0.97 0.98 33789
18 0.00 0.00 0.00 7
19 0.00 0.00 0.00 26
21 0.67 0.11 0.18 57
22 0.97 0.58 0.73 2803
23 0.00 0.00 0.00 7
24 0.81 0.94 0.87 21830
25 1.00 1.00 1.00 1880
26 0.00 0.00 0.00 14
27 0.72 0.73 0.72 3261
28 1.00 1.00 1.00 14425
29 0.85 0.99 0.92 1255
30 0.00 0.00 0.00 198
31 1.00 1.00 1.00 82055
32 1.00 1.00 1.00 4301
accuracy 0.98 364166
macro avg 0.71 0.65 0.66 364166
weighted avg 0.97 0.98 0.97 364166
cat = CatBoostClassifier(iterations=100)
cat.fit(X_train, y_train)
y_pred_cat = cat.predict(X_test)
acc_cat = accuracy_score(y_test, y_pred_cat)
conf = confusion_matrix(y_test, y_pred_cat)
clf_report = classification_report(y_test, y_pred_cat)
Learning rate set to 0.5 0: learn: 1.0014762 total: 3.52s remaining: 5m 48s 1: learn: 4.4842977 total: 6.12s remaining: 4m 59s 2: learn: 448.0797461 total: 8.79s remaining: 4m 44s 3: learn: 1828.4033638 total: 11.6s remaining: 4m 38s 4: learn: 2163.4590209 total: 14.5s remaining: 4m 36s 5: learn: 543.8278404 total: 17.4s remaining: 4m 32s 6: learn: 499.4815133 total: 20.1s remaining: 4m 27s 7: learn: 463.3452341 total: 22.7s remaining: 4m 21s 8: learn: 400.1935794 total: 25.3s remaining: 4m 15s 9: learn: 301.2618592 total: 28.2s remaining: 4m 13s 10: learn: 328.7573425 total: 31.1s remaining: 4m 11s 11: learn: 326.7148245 total: 34s remaining: 4m 9s 12: learn: 284.7939720 total: 36.7s remaining: 4m 5s 13: learn: 246.2083953 total: 39.4s remaining: 4m 1s 14: learn: 218.5380871 total: 42.1s remaining: 3m 58s 15: learn: 302.4180072 total: 45.1s remaining: 3m 56s 16: learn: 283.4308847 total: 48s remaining: 3m 54s 17: learn: 257.6146268 total: 50.7s remaining: 3m 51s 18: learn: 227.9043492 total: 53.5s remaining: 3m 47s 19: learn: 202.3874307 total: 56.2s remaining: 3m 44s 20: learn: 179.9474287 total: 58.9s remaining: 3m 41s 21: learn: 166.1768008 total: 1m 1s remaining: 3m 39s 22: learn: 181.7092653 total: 1m 4s remaining: 3m 36s 23: learn: 175.9586111 total: 1m 7s remaining: 3m 34s 24: learn: 181.8385216 total: 1m 10s remaining: 3m 30s 25: learn: 197.7681567 total: 1m 12s remaining: 3m 27s 26: learn: 179.3972472 total: 1m 15s remaining: 3m 25s 27: learn: 181.3368717 total: 1m 18s remaining: 3m 22s 28: learn: 156.8037334 total: 1m 21s remaining: 3m 19s 29: learn: 167.9806554 total: 1m 24s remaining: 3m 17s 30: learn: 190.2180226 total: 1m 27s remaining: 3m 13s 31: learn: 170.0498274 total: 1m 29s remaining: 3m 10s 32: learn: 157.5070215 total: 1m 32s remaining: 3m 8s 33: learn: 131.8342794 total: 1m 35s remaining: 3m 5s 34: learn: 140.5608261 total: 1m 38s remaining: 3m 2s 35: learn: 192.4880355 total: 1m 41s remaining: 2m 59s 36: learn: 171.0917027 total: 1m 43s remaining: 2m 56s 37: learn: 154.3120471 total: 1m 46s remaining: 2m 53s 38: learn: 128.7470442 total: 1m 49s remaining: 2m 51s 39: learn: 137.1303710 total: 1m 52s remaining: 2m 48s 40: learn: 177.3932408 total: 1m 55s remaining: 2m 46s 41: learn: 149.9230529 total: 1m 58s remaining: 2m 43s 42: learn: 132.0465373 total: 2m remaining: 2m 39s 43: learn: 125.7449745 total: 2m 3s remaining: 2m 37s 44: learn: 123.7508094 total: 2m 6s remaining: 2m 34s 45: learn: 132.0295684 total: 2m 9s remaining: 2m 31s 46: learn: 109.3483204 total: 2m 12s remaining: 2m 29s 47: learn: 148.9888131 total: 2m 14s remaining: 2m 26s 48: learn: 166.3803982 total: 2m 17s remaining: 2m 23s 49: learn: 137.9084915 total: 2m 20s remaining: 2m 20s 50: learn: 130.6854619 total: 2m 23s remaining: 2m 17s 51: learn: 122.1500752 total: 2m 26s remaining: 2m 15s 52: learn: 143.8020903 total: 2m 29s remaining: 2m 12s 53: learn: 154.0074006 total: 2m 31s remaining: 2m 9s 54: learn: 136.9570899 total: 2m 34s remaining: 2m 6s 55: learn: 119.1706178 total: 2m 37s remaining: 2m 3s 56: learn: 128.7746874 total: 2m 40s remaining: 2m 1s 57: learn: 151.7225275 total: 2m 43s remaining: 1m 58s 58: learn: 122.1054599 total: 2m 46s remaining: 1m 55s 59: learn: 108.1747022 total: 2m 48s remaining: 1m 52s 60: learn: 181.7407008 total: 2m 51s remaining: 1m 49s 61: learn: 156.3364489 total: 2m 54s remaining: 1m 46s 62: learn: 133.8281872 total: 2m 57s remaining: 1m 44s 63: learn: 103.4778304 total: 3m remaining: 1m 41s 64: learn: 89.8928007 total: 3m 2s remaining: 1m 38s 65: learn: 152.5775698 total: 3m 5s remaining: 1m 35s 66: learn: 139.0814236 total: 3m 8s remaining: 1m 32s 67: learn: 129.6280340 total: 3m 11s remaining: 1m 30s 68: learn: 127.6386635 total: 3m 14s remaining: 1m 27s 69: learn: 98.6849365 total: 3m 16s remaining: 1m 24s 70: learn: 98.7607154 total: 3m 19s remaining: 1m 21s 71: learn: 115.6433045 total: 3m 22s remaining: 1m 18s 72: learn: 118.3375625 total: 3m 25s remaining: 1m 15s 73: learn: 128.3580173 total: 3m 28s remaining: 1m 13s 74: learn: 100.7277246 total: 3m 31s remaining: 1m 10s 75: learn: 127.8588919 total: 3m 33s remaining: 1m 7s 76: learn: 135.1362509 total: 3m 36s remaining: 1m 4s 77: learn: 134.2434358 total: 3m 39s remaining: 1m 1s 78: learn: 121.1008131 total: 3m 42s remaining: 59.1s 79: learn: 114.3376825 total: 3m 45s remaining: 56.3s 80: learn: 110.6201460 total: 3m 48s remaining: 53.5s 81: learn: 100.8578425 total: 3m 50s remaining: 50.6s 82: learn: 111.1851405 total: 3m 53s remaining: 47.8s 83: learn: 103.5908212 total: 3m 56s remaining: 45s 84: learn: 103.9389498 total: 3m 59s remaining: 42.2s 85: learn: 106.7248996 total: 4m 2s remaining: 39.4s 86: learn: 118.8261016 total: 4m 5s remaining: 36.6s 87: learn: 109.4252575 total: 4m 7s remaining: 33.8s 88: learn: 94.0010108 total: 4m 10s remaining: 30.9s 89: learn: 89.6506516 total: 4m 13s remaining: 28.1s 90: learn: 123.6613819 total: 4m 16s remaining: 25.3s 91: learn: 112.2204298 total: 4m 19s remaining: 22.5s 92: learn: 97.8841373 total: 4m 21s remaining: 19.7s 93: learn: 108.9219836 total: 4m 24s remaining: 16.9s 94: learn: 107.5061509 total: 4m 27s remaining: 14.1s 95: learn: 98.2353095 total: 4m 30s remaining: 11.3s 96: learn: 96.0087220 total: 4m 33s remaining: 8.45s 97: learn: 91.7360796 total: 4m 36s remaining: 5.63s 98: learn: 106.1429995 total: 4m 38s remaining: 2.81s 99: learn: 105.8996912 total: 4m 41s remaining: 0us
print(f"Accuracy Score of Ada Boost Classifier is : {acc_cat}")
print(f"Confusion Matrix : \n{conf}")
print(f"Classification Report : \n{clf_report}")
Accuracy Score of Ada Boost Classifier is : 0.9560145647863886
Confusion Matrix :
[[ 0 0 557 ... 0 0 0]
[ 1426 20535 356 ... 0 0 0]
[ 59 153 65932 ... 0 0 0]
...
[ 1 163 0 ... 0 0 0]
[ 0 0 0 ... 0 82055 0]
[ 0 0 0 ... 0 0 4298]]
Classification Report :
precision recall f1-score support
0 0.00 0.00 0.00 559
1 0.93 0.90 0.91 22870
2 0.98 1.00 0.99 66144
3 1.00 1.00 1.00 21149
4 0.00 0.00 0.00 10
5 0.93 0.79 0.85 1748
6 1.00 1.00 1.00 38801
7 0.74 0.62 0.67 9173
8 0.99 0.99 0.99 18869
9 1.00 0.99 1.00 544
10 1.00 0.72 0.84 661
11 0.00 0.00 0.00 7
12 0.36 0.15 0.21 1556
13 0.00 0.01 0.00 164
14 0.71 0.29 0.41 272
15 1.00 0.98 0.99 491
16 0.96 0.93 0.94 15240
17 0.99 0.95 0.97 33789
18 0.00 0.00 0.00 7
19 0.00 0.00 0.00 26
21 0.00 0.00 0.00 57
22 0.86 0.54 0.66 2803
23 0.00 0.00 0.00 7
24 0.78 0.93 0.85 21830
25 1.00 1.00 1.00 1880
26 0.00 0.00 0.00 14
27 0.64 0.70 0.67 3261
28 1.00 1.00 1.00 14425
29 0.81 0.86 0.84 1255
30 0.00 0.00 0.00 198
31 1.00 1.00 1.00 82055
32 1.00 1.00 1.00 4301
accuracy 0.96 364166
macro avg 0.61 0.57 0.59 364166
weighted avg 0.96 0.96 0.96 364166
etc = ExtraTreesClassifier()
etc.fit(X_train, y_train)
y_pred_etc = etc.predict(X_test)
acc_etc = accuracy_score(y_test, y_pred_etc)
conf = confusion_matrix(y_test, y_pred_etc)
clf_report = classification_report(y_test, y_pred_etc)
print(f"Accuracy Score of Ada Boost Classifier is : {acc_etc}")
print(f"Confusion Matrix : \n{conf}")
print(f"Classification Report : \n{clf_report}")
Accuracy Score of Ada Boost Classifier is : 0.9620337977735428
Confusion Matrix :
[[ 358 2 164 ... 0 2 0]
[ 0 22252 467 ... 38 48 0]
[ 2 113 65913 ... 0 52 0]
...
[ 0 153 9 ... 6 2 0]
[ 0 30 76 ... 1 81909 0]
[ 0 0 0 ... 0 0 4086]]
Classification Report :
precision recall f1-score support
0 0.99 0.64 0.78 559
1 0.98 0.97 0.98 22870
2 0.99 1.00 0.99 66144
3 1.00 0.98 0.99 21149
4 0.17 0.20 0.18 10
5 0.92 0.93 0.92 1748
6 0.99 1.00 1.00 38801
7 0.69 0.63 0.66 9173
8 1.00 0.99 1.00 18869
9 0.99 0.88 0.93 544
10 1.00 0.99 1.00 661
11 0.00 0.00 0.00 7
12 0.48 0.43 0.45 1556
13 0.09 0.04 0.06 164
14 0.33 0.15 0.21 272
15 0.99 0.96 0.98 491
16 1.00 0.97 0.99 15240
17 0.97 0.98 0.98 33789
18 0.00 0.00 0.00 7
19 0.25 0.08 0.12 26
21 0.25 0.05 0.09 57
22 0.73 0.63 0.67 2803
23 0.00 0.00 0.00 7
24 0.80 0.86 0.83 21830
25 1.00 0.97 0.98 1880
26 0.00 0.00 0.00 14
27 0.66 0.62 0.64 3261
28 0.99 1.00 0.99 14425
29 0.86 0.78 0.82 1255
30 0.11 0.03 0.05 198
31 0.99 1.00 1.00 82055
32 0.99 0.95 0.97 4301
accuracy 0.96 364166
macro avg 0.66 0.62 0.63 364166
weighted avg 0.96 0.96 0.96 364166
lgbm = LGBMClassifier(learning_rate = 1)
lgbm.fit(X_train, y_train)
y_pred_lgbm = lgbm.predict(X_test)
acc_lgbm = accuracy_score(y_test, y_pred_lgbm)
conf = confusion_matrix(y_test, y_pred_lgbm)
clf_report = classification_report(y_test, y_pred_lgbm)
print(f"Accuracy Score of Ada Boost Classifier is : {acc_lgbm}")
print(f"Confusion Matrix : \n{conf}")
print(f"Classification Report : \n{clf_report}")
Accuracy Score of Ada Boost Classifier is : 0.1504039366662456
Confusion Matrix :
[[ 0 0 189 ... 0 241 0]
[ 0 0 9290 ... 0 11334 0]
[ 0 0 27810 ... 0 33496 0]
...
[ 0 0 96 ... 0 82 0]
[ 0 0 40802 ... 0 25680 0]
[ 0 0 1127 ... 0 2659 0]]
Classification Report :
precision recall f1-score support
0 0.00 0.00 0.00 559
1 0.00 0.00 0.00 22870
2 0.18 0.42 0.25 66144
3 0.00 0.00 0.00 21149
4 0.00 0.00 0.00 10
5 0.00 0.00 0.00 1748
6 0.00 0.00 0.00 38801
7 0.01 0.00 0.00 9173
8 0.02 0.05 0.03 18869
9 0.00 0.00 0.00 544
10 0.00 0.00 0.00 661
11 0.00 0.00 0.00 7
12 0.00 0.00 0.00 1556
13 0.00 0.00 0.00 164
14 0.00 0.00 0.00 272
15 0.00 0.00 0.00 491
16 0.54 0.00 0.00 15240
17 0.53 0.01 0.01 33789
18 0.00 0.00 0.00 7
19 0.00 0.00 0.00 26
21 0.00 0.00 0.00 57
22 0.00 0.00 0.00 2803
23 0.00 0.00 0.00 7
24 1.00 0.00 0.00 21830
25 0.00 0.00 0.00 1880
26 0.00 0.00 0.00 14
27 0.01 0.00 0.00 3261
28 0.46 0.00 0.01 14425
29 0.00 0.00 0.00 1255
30 0.00 0.00 0.00 198
31 0.16 0.31 0.21 82055
32 0.00 0.00 0.00 4301
accuracy 0.15 364166
macro avg 0.09 0.02 0.02 364166
weighted avg 0.22 0.15 0.10 364166
models = pd.DataFrame({
'Model' : ['Logistic Regression', 'KNN', 'Decision Tree Classifier', 'Random Forest Classifier','Ada Boost Classifier',
'Gradient Boosting Classifier', 'XgBoost', 'Cat Boost', 'Extra Trees Classifier', 'LGBM'],
'Score' : [acc_lr, acc_knn, acc_dtc, acc_rd_clf, acc_ada, acc_gb, acc_xgb, acc_cat, acc_etc, acc_lgbm]
})
models.sort_values(by = 'Score', ascending = False)
| Model | Score | |
|---|---|---|
| 6 | XgBoost | 0.975687 |
| 4 | Ada Boost Classifier | 0.971884 |
| 3 | Random Forest Classifier | 0.966746 |
| 2 | Decision Tree Classifier | 0.964813 |
| 8 | Extra Trees Classifier | 0.962034 |
| 7 | Cat Boost | 0.956015 |
| 5 | Gradient Boosting Classifier | 0.912397 |
| 1 | KNN | 0.818846 |
| 0 | Logistic Regression | 0.703097 |
| 9 | LGBM | 0.150404 |
px.bar(data_frame = models.sort_values(by = 'Score', ascending = False),
x = 'Score', y = 'Model', color = 'Score', template = 'plotly_dark', title = 'Models Comparison')
We got accuracy score of 97% which is quite impresive.